Fully semantic segmentation for rectal cancer based on post-nCRT MRl modality and deep learning framework

Purpose Rectal tumor segmentation on post neoadjuvant chemoradiotherapy (nCRT) magnetic resonance imaging (MRI) has great significance for tumor measurement, radiomics analysis, treatment planning, and operative strategy. In this study, we developed and evaluated segmentation potential exclusively on post-chemoradiation T2-weighted MRI using convolutional neural networks, with the aim of reducing the detection workload for radiologists and clinicians. Methods A total of 372 consecutive patients with LARC were retrospectively enrolled from October 2015 to December 2017. The standard-of-care neoadjuvant process included 22-fraction intensity-modulated radiation therapy and oral capecitabine. Further, 243 patients (3061 slices) were grouped into training and validation datasets with a random 80:20 split, and 41 patients (408 slices) were used as the test dataset. A symmetric eight-layer deep network was developed using the nnU-Net Framework, which outputs the segmentation result with the same size. The trained deep learning (DL) network was examined using fivefold cross-validation and tumor lesions with different TRGs. Results At the stage of testing, the Dice similarity coefficient (DSC), 95% Hausdorff distance (HD95), and mean surface distance (MSD) were applied to quantitatively evaluate the performance of generalization. Considering the test dataset (41 patients, 408 slices), the average DSC, HD95, and MSD were 0.700 (95% CI: 0.680–0.720), 17.73 mm (95% CI: 16.08–19.39), and 3.11 mm (95% CI: 2.67–3.56), respectively. Eighty-two percent of the MSD values were less than 5 mm, and fifty-five percent were less than 2 mm (median 1.62 mm, minimum 0.07 mm). Conclusions The experimental results indicated that the constructed pipeline could achieve relatively high accuracy. Future work will focus on assessing the performances with multicentre external validation.


Introduction
Colorectal cancer is the fourth most common cancer worldwide, with an annual incidence of more than 700,000 cases and the third-highest mortality rate [1].According to the main international clinical guidelines [2,3], the recommended treatment for locally advanced rectal cancer (LARC) is neoadjuvant chemoradiotherapy (nCRT), followed by total mesorectal excision (TME).In recent years, the watch-and-wait strategy appears to be a safer option in patients who have achieved pathologic complete response (pCR) after nCRT [4], while local excision, including transanal excision, transanal endoscopic microsurgery, and transanal minimally invasive surgery may be suitable for good response [5,6].At the same time, patients may also have the possibility of liver and pulmonary metastases [7,8].Therefore, accurate response prediction is essential in planning optimal treatment strategies [9][10][11].
As recommended by the guidelines, response assessment should be performed with the combination of restaging magnetic resonance imaging (MRI), digital rectal examination, and endoscopy, in which MRI plays an important role [12,13].However, the basic step for prediction is to accurately identify the residual tumor region or the tumor bed [14].In general, the procedure is delineated manually by the radiologists on medical software, which is labor intensive and time-consuming [15].As the essential modality of rectal cancer, T2-weighted imaging (T2WI) can display anatomical information with a clearer tumor boundary by high spatial resolution [16,17].Theoretically, patients accept MRI scanning before and after therapy to obtain baseline MRI (pre-nCRT MRI) and post-nCRT MRI [18].Although pre-nCRT MRI is an important reference, its availability and accessibility is limited in real clinical practice.When conducting detection tasks only based on post-nCRT MRI images, the nCRT-induced submucosal edema, fibrosis, and/or mucin production make it difficult to distinguish changes after treatment from the residual tumor [19].Meanwhile, the pathological changes induced by nCRT make the tumor appearance different from the primary counterpart in different tumor regression grades (TRGs) [20].Some unsatisfactory and inaccurate results for restaging using standard manual MRI protocols [21] led to the need for a separate evaluation system for post-nCRT imaging.Currently, only a few of studies have used post-nCRT MRI for segmentation and prediction [22][23][24], but most are not based on the direct segmentation of lesions.The semantic segmentation for rectal cancer using the nnUNet framework [25][26][27] and post-nCRT single MRI modality has never been reported.The most commonly used medical image modality in former research is colon images scanned by computed tomography (CT) [28][29][30].
In this study, we explored and examined the segmentation potential for LARC exclusively on post-chemoradiation T2-weighted MRI using state-of-the-art deep learning (DL) architectures, with the aim to provide clinical auto-delineation tools for subsequent measurement and analysis [31][32][33].Meanwhile, the generalization performance was further validated on tumor lesions with different TRGs.The quantitative metrics [34,35], including Dice similarity coefficient (DSC), 95% Hausdorff distance (HD95), and mean surface distance (MSD), confirmed the practical implications of reducing workload whether for colorectal cancer physicians or radiologists.

Patients and dataset
The retrospective study enrolled 372 consecutive patients with LARC from October 2015 to December 2017.The inclusion criteria were as follows: (1) All candidates were pathologically confirmed with locally advanced rectal adenocarcinoma (excluding mucinous adenocarcinoma).
(2) All candidates received a complete and standard nCRT process, which included 22-fraction intensitymodulated radiation therapy and oral capecitabine of 825 mg/m 2 twice per day.(3) All candidates were scanned by T2-weighted MRI within 1 week before nCRT.(4) All candidates were scanned by T2-weighted MRI within 1 week before TME surgery.(5) All candidates were clinically confirmed to be in T3, T4, or N+ stage using baseline MRI.The clinical protocol was approved by the medical ethics committee of Beijing Cancer Hospital.Executing the process shown in Fig. 1, the overall dataset was produced containing rectal cancer images from 284 patients.Then, it was artificially grouped into training and validation dataset (N = 243), as well as test dataset (N = 41).

MRI scan, image acquisition, and data preprocessing
All the post-nCRT MRI images were obtained with a 3.0-T MRI scanner (Discovery MR750; GE Healthcare, WI, USA).To minimize colonic motility for each patient, 20 mg of scopolamine butylbromide was administered intramuscularly 30 minutes before the MRI scan.A conventional rectal MRI protocol was applied to all patients, the standard process mainly included high-resolution T2WI from axial, coronal, and sagittal position, with diffusion-weighted imaging (DWI) as an auxiliary reference for subsequent delineation.And the main scan parameters are as follows: For the preprocessing steps, each volume was initially resampled to a consistent spatial resolution of 0.3516 × 0.3516 × 3.3 mm 3 to ensure a uniform physical distance interpretation across acquired 3D images.The layers of each patient ranged from 18 to 40, with the same image size of 512 × 512 pixels.Then a total of 284 3D images were converted into 2D images using the SimpleITK package, and 3469 slices containing tumor lesions were screened to create the whole dataset.Each finished slice was stored in the NIfTI format (ie .nii.gz extension).To attain a standard normal distribution of image intensities, z-scores were utilized for the normalization (μ ± σ) of all the generated slices.At the final stage, 3061 slices were split to train the model with a random 20% internal validation set.Further, 408 slices were not involved in the model-building process for independent external validation.

ROI delineation and manual annotation
The regions of interest (ROIs) on post-nCRT T2-weighted images were independently delineated by two experienced radiologists with 8 and 10 years of experience in abdominal radiology.And the ROIs were defined as all the residual tumors and suspected fibrotic areas.The lesion area on each slice was drawn along the tumor contour using ITK-SNAP v3.8.0 software.All the controversial images were reviewed by a third radiologist, and an agreement was reached if inconsistency existed in the judgment of tumor boundary details.The ROIs were created manually on T2-weighted images, the readers also referred to DWI images to avoid false positives or false negatives in the highest degree.
After complete nCRT treatment and TME, surgically resected specimens were evaluated by two experienced pathologists with 10 and 15 years of experience in gastrointestinal disease, respectively.The annotations of TRG were referenced to the National Comprehensive Cancer Network and American Joint Committee on Cancer TRG system [36].As shown in Fig. 2, the TRG indicator was defined into four levels (TRG0, TRG1, TRG2, and TRG3), and patients on TRG1, TRG2, and TRG3 were considered during model training and testing.

Model construction: nnUNet framework for rectal tumor segmentation
nnUNet (https:// github.com/ MIC-DKFZ/ nnUNet) is a general adaptive segmentation framework proven to have strong performance on 10 public datasets in international biomedical segmentation competitions (Liver Tumor, Brain Tumor, Hippocampus, Lung Tumor, Prostate, Cardiac, Pancreas Tumor, Colon Cancer, Hepatic Vessels, and Spleen) [25].Merely regarding colorectal cancer segmentation, 190 CT images of colon cancer [37] were used in Medical Segmentation Decathlon (Memorial Sloan Kettering Cancer Center).However, the framework has not been widely applied to MRI images of rectal tumors yet.As demonstrated in Fig. 3, the overview of the segmentation pipeline comprised four major stages, including preprocessing, data augmentation, model training, and post-processing, which was capable of automatic network configuration.
In more detail, the overall segmentation network structure was symmetrically composed of eight layers, as shown in Fig. 4, extracting and reassembling features through network structure and parameter configuration.changed similarly, and the feature fusion was performed with the skip layers.At the end, a 1 × 1 convolution and a softmax layer were implemented to the network, generating the predicted ROI results.Our source code is available via GitHub (https:// github.com/ Post-nCRT/ Segme ntati on-of-rectal-cancer) and can be coordinated with the nnUNet code.

Evaluation
We calculated the most commonly used metrics based on prediction results and the gold standard of doctors to quantitatively evaluate the performance of the DL model.DSC, Jaccard, Recall, Precision, and F1-score were used to measure the performance in the training stage, and DSC, HD95, and MSD were applied as the main indexes to examine the test dataset [38].All the formulas were expressed as follows: ① DSC (Dice similarity coefficient): DSC is usually used to calculate the volume overlap between two sets with a value range of [0,1], where M ∩ N represents the intersection of the ground truth (N) and prediction (M), and | | represents the number of elements.
② Jaccard (Jaccard similarity coefficient): Given two sets M and N , the Jaccard coefficient is defined as the ratio of the intersection of M and N to the union of M and N .
③ Recall (R): Recall is defined as the proportion of true-positive samples detected in all positive samples.Its value is equivalent to sensitivity.
④ Precision (P): Precision essentially measures the proportion of the true-positive samples among all samples predicted to be positive. (1) ⑤ F1-score: The F β − score considers precision and recall together, and the F1-score is the harmonic mean of precision and recall, which can be expressed as Eq. 6.
⑥ HD95 (95% Hausdorff distance): HD95 mainly measures the maximum distance between the ground truth (N) and prediction (M), where hd(M, N ) and hd(N , M) are the unidirectional Hausdorff distances from set A to set B and from set B to set A, respectively.And K 95% represents the 95th percentile.
⑦ MSD (Mean surface distance): MSD mainly measures the mean distance between the two surfaces, where d(v, S(K )) denotes the shortest distance of an arbitrary volume v to S(K ).
(   ⑧ ICC (Intraclass correlation coefficient): ICC is applied to evaluate the reliability between multiple measurements of the same object, where MS group and MS error respectively represent the mean squares of group and error, U is defined as the number of measurements.

Clinical characteristics of patients with LARC
A total of 372 patients with LARC were selected as preliminary candidates, and 284 patients (243 in the training cohort, mean age 56.37 ± 9.83 years; 41 in the test cohort, mean age 55.59 ± 11.66 years) were eventually enrolled in the study.The clinical characteristics of patients in the training and test cohorts, including number of MRI slices, age, sex, and TRG levels, are summarized in Table 1.(
The learning curves of the first fold to fifth fold are depicted in Fig. 5.The changes in training and validation losses were measured using the scale of the left axis, and DSC values on the validation dataset were visualized using the right axis.From 0 to 200 epochs, the DSC values smoothly increased and then gradually stabilized at 0.88 after 200 epochs.
The examples of segmentation results were compared with the original images and segmention output, as      results and ground truth on different TRG levels.And it was evident that the values of HD95 and MSD basically increased from TRG1 to TRG3, disregarding the potential decrease in HD95 caused by a higher training and testing slices of TRG2.The rise of the two metrics indicated that the segmentation of tumor surface boundaries became more challenging as the degree of tumor regression increased after nCRT, which was also aligned with the practical experience on manual delineation.Fig. 7(al) visually shows the segmentation examples of tumor lesions, each TRG level is illustrated with two cases, with the comparison of both the prediction results of DL model and the ground truth from radiologists.Furthermore, statistical analyses were conducted to provide more adequate comparability of DSC.The intraclass correlation coefficient (ICC) of the representative radiomics feature, the maximum diameter, was computed by pyradiomics 3.0.1 and SPSS Statistics 27.0 (IBM official version).In Table 4, the ICC between expert readers, the ICC between expert readers and deep learning model, and the ICC mentioned by previous literature [24], are summarized together to provide quantitative explanations for the difficulty of rectal tumor segmentation on post-nCRT.The ICC of the same lesion areas delineated by radiologists and predicted by deep learning model was 0.669 (95%CI: 0.612, 0.719), comparing the interreader agreement on T2 images between the two human radiologists with the value of 0.739 (95% CI: 0.515, 0.865).

Discussion
The automatic segmentation of rectal tumors on post-nCRT MRI makes a positive contribution to the evaluation of the nCRT effect, which is also the footstone of the subsequent processes, including tumor measurement, radiomics analysis, surgical plan decision, and so forth.When only post-nCRT MRI images are available, it is particularly critical to ensure the reliability and accuracy of segmentation results.A high probability exists that confounding factors would be introduced if the images of patients with pCR (theoretically accounting for 20%) were directly sent to the segmentation model for training [11].The clinicians could neither delineate the ROI nor equate it with a completely tumor-free region.Thus, the patients with pCR were first excluded, and only patients without pCR (243 with 3061 slices) were used to construct the segmentation model.
Previous studies included cases only related to post-nCRT MRI images involving segmentation of the rectal wall or suspicious areas on post-nCRT [22][23][24].Still, they were not directly related to the segmentation of tumor areas.Thomas et al. [22] trained a fully convoluted network for the segmentation of the rectal wall on post-chemoradiation T2-weighted MRI, and the median DSC reached 0.680.Pang et al. [23] employed both U-Net and 4-channel U-Net on "suspicious region" segmentation for follow-up radiomics analysis, achieving DSC values of 0.656 (95% CI: 0.630-0.683)and 0.660 (95% Table 4 ICCs for assessment of task difficulty: the ICC between expert readers, the ICC between expert readers and deep learning model, and the ICC mentioned by previous literature [24] Two objects Post-nCRT ICC Two human radiologists (T2WI) 0.739 (95% CI: 0.515, 0.865) DL model and human radiologists (T2WI) 0.669 (95%CI: 0.612, 0.719) Two human radiologists [24] (DWI) 0.750 (95%CI: 0.630, 0.830) Automated segmentation using the software [24] (DWI) 0.530-0.660Semiautomated segmentation using the software [24] (DWI) 0.610-0.750CI: 0.628-0.691),respectively.Meanwhile, compared with the manual method, the trained DL model showed better performance than either automated or semiautomated segmentation using the software with DSC of 0.420 ± 0.230 (ICC: 0.530~0.660)and 0.410 ± 0.220 (ICC: 0.610~0.750)[24], respectively.Although relatively stable results were obtained in this study, it still has some limitations for future improvement and optimization.From the perspective of the dataset, we could recruit patients on each TRG grade as much as possible to ensure a more balanced sample distribution from different TRGs.Additionally, the DL model trained on the retrospective dataset could be further validated on a prospective multicenter dataset.In light of the diminishing likelihood of obtaining validation through anatomopathological reports due to the increasing use of the watch-andwait protocol and the option of local excision [39,40], next endeavors will be laid on exploring weakly supervised or unsupervised artificial intelligence approaches in the scenario of few pathological labels [41][42][43].And tissue specimens from appropriate patients with local excision can also be obtained for pathologic study, with less differences from the patients that undergo TME surgery.
Deducing the growth sorely from model promotion and imaging technology, it is considered that introducing multi-stage segmentation steps or attention mechanisms may increase the segmentation accuracy.Furthermore, the developing application of the suitable integration of 2D and 3D models [44,45] in diverse clinical scenes will be the desired research direction.As post-nCRT imaging techniques for rectal cancer continue to advance, investigating the automated segmentation performance through multimodal imaging technologies such as PET/CT or PET/MRI also represents a promising avenue [16,46].

Conclusions
In this study, we developed an automatic segmentation pipeline for LARC exclusively based on post-nCRT T2-weighted MRI.It was the first attempt to evaluate and validate the application potential of nnUNet framework for rectal cancer on post-nCRT MRI imaging, differing from CT slices in previous studies.The experimental results indicated a relatively high accuracy (DSC, HD95, and MSD).Moreover, the robustness of the network was also verified by analyzing the segmented tumor lesions on diverse TRGs.The model is expected to be not only an auxiliary tool for manual labeling but also a potential practical tool for subsequent tumor measurement, radiomics analysis, treatment planning, and operative strategy with further multicentre external validation.Future studies will focus on exploring effective methods to combine 2D models with 3D models and further apply them to clinical populations.

Fig. 1
Fig. 1 Flowchart showing the inclusion criteria for patients and the process of the overall dataset

Fig. 5
Fig. 5 Learning curves of fivefold cross-validation: (1) a-e Loss graphs and evaluation metrics (DSC) from first fold to fifth fold.(2) Left axis: changes in losses on training and validation dataset from 0 to 499 epochs.(3) Right axis: DSC values on the validation dataset from 0 to 499 epochs

Fig. 6 Fig. 7
Fig. 6 Examples of comparison between segmentation results of the DL model and the annotations from radiologists: a-g original images; h-n ground truth; and o-u prediction results.The red areas represent ROIs

Table 1
Clinical records of patients with LARC

Table 3
Evaluation metrics (DSC, HD95, and MSD) for the test dataset on different TRGs